# Lab 18 - Cental Limit Theorem (CLT)

The Cental Limit Theorem says that if we take many samples from a population, then the means of those samples have a normal distribution. As we take more samples, the mean of the normal distribution gets closer to the mean of the population, and the standard deviation of the normal distribution gets closer to standard deviation of the population divided by the square root of the sample size.

We will experimentally test the Central Limit Theorem in this lab. 

Download the CSV file of nuitritional information for Starbucks drinks, which is taken from [Kaggle.com](https://www.kaggle.com/starbucks/starbucks-menu). 

First, let's import the necessary libraries.

In [56]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

### Loading and cleaning the data.

Read the CSV file into a dataframe called `drinks`.

Display your `drinks` dataframe below. 

What symbol represents missing data in this data set? Since this is not a standard symbol, Pandas, does not recognize it as meaning missing data. 

Pandas does not recognize `-` as missing data here, so we to pass this information to the `read_csv` function by adding the parameter `na_values = "-"`. Try reading in the file again, using this parameter below. 

Display your revised dataframe.

How does Pandas represent missing data? 

There is one more problem with how the data was read in. What is the first column?

The drink name is really what identifies each row, and should be the row name, not stored in a column. We can do this by adding the parameter `index_col = 0`, which tells `read_csv` that the first column, numbered 0, is the indexes, or row names. 

Read in the file one more time below adding both parameters. 

Display your revised dataframe.

Do you notice any other possible issues? 

There is no data for some of the rows (all columns are `NaN`). We can remove these rows from the dataframe with the code `drinks = drinks.dropna(axis = 0)`. Try it below and display your new dataframe. 

### Testing the Central Limit Theorem: Calories column

Plot a histogram of the calorie column data, to visualize its distribution.

What do you notice about this distribution?

Now we will take 10,000 random samples of size 30 from the calorie data, compute the mean of each sample, and plot the means as a histogram. We are visualizing the *sampling distribution of the mean* (Lab 11).

The pseudo-code to compute the means and store them in a list is:

create an empty list
repeat 10,000 times:
 take a sample of size 30 from the drinks dataframe
 compute the mean number of calories for this sample
 add this mean to your list


Write the Python code below.

Plot the means as a histogram. Remember to convert your list into a Pandas series and adjust the number of bins to get a good visualization.

What distribution does this look like?

Now we'll simulate the sampling distribution of the mean again, but using a sample size of 50.

For reference, the pseudo-code to compute the means and store them in a list is:

create an empty list
repeat 10,000 times:
 take a sample of size 50 from the drinks dataframe
 compute the mean number of calories for this sample
 add this mean to your list


Next plot the histogram of these means:

To better compare the two histograms, let's plot them on the same graph, using the `density = True` parameter for each histogram.

What happens to the distribution as the sample size increases? How does the mean change? How does the standard deviation change? Does this make sense?

The Central Limit Theorem tells us that the mean of the sampling distribution should be close to the mean of the population. Let's test this by computing:

1. the mean of the population (the mean of the calorie column in the `drinks` dataframe)
2. the mean of the means of the samples of size 30
3. the mean of the means of the samples of size 50

The Central Limit Theorem also tells us that when the sample size is large:

$\text{SD of all possible sample means} = \frac{\text{population SD}}{\sqrt{\text{sample size}}}$ 

Let's test this for the size 30 samples. The left hand side of the equation is the standard deviation (SD) of these sample means:

The right hand side of the equation is the standard deviation of the population (calories of all drinks), divided by $\sqrt{\text{sample size}}$, which we can calculate with the code `np.sqrt(30)`.

How similar are the two numbers? This relationship becomes more accurate as the sample size improves. For example, if we took samples (with replacement) of size 1000, these two numbers will be equal to the first decimal point.

### Testing the Cental Limit Theorem: Protein column

Let's repeat this experiment for another variable. Plot a histogram of the amount of protein in the drinks.

Now let's take 10,000 samples, each of size 1,000, and compute their means. Use the parameter `replace = True` to sample from the `drinks` dataframe with replacement.

The pseudo-code to compute the means and store them in a list is:

create an empty list
repeat 10,000 times:
 take a sample of size 1000 with replacement from the drinks dataframe
 compute the mean number of calories for this sample
 add this mean to your list


Plot a histogram of the means of the samples:

What does the distribution look like?

Let's compare the mean and standard deviation with what is predicted by the Central Limit Theorem.

To compare the means, compute
1. the mean of your sample means
2. the mean of the population (protein column in the `drinks` dataframe)

How do the two means compare?

To compare the standard devatiations, compute:
1. the standard deviation of your sample mean
2. the standard deviation of the population, divided by the square root of the sample size (1000)

How do the two standard deviations compare? 

### Challenges:
- What does the sampling distribution of the mean of the `Carb. (g)` column look like when the sample size is 30, 300, and 900?
- How does the mean of the means of the samples and the population mean compare for the `Carb. (g)` column when the sample size is 30, 300, and 900?
- How does the standard deviation of the means of the samples and the population standard deviation compare for the `Carb. (g)` column when the sample size is 30, 300, and 900?